The harmonic mean ''p''-value (HMP) is a statistical technique for addressing the

multiple comparisons problem In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. The more inferences ...

that controls the strong-sense family-wise error rate (this claim has been disputed). It improves on the

power Power most often refers to: * Power (physics), meaning "rate of doing work" ** Engine power, the power put out by an engine ** Electric power * Power (social and political), the ability to influence people or events ** Abusive power Power may a ...

Bonferroni correction In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem. Background The method is named for its use of the Bonferroni inequalities. An extension of the method to confidence intervals was proposed by Oliv ...

by performing combined tests, i.e. by testing whether ''groups'' of ''p''-values are statistically significant, like

Fisher's method In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combi ...

. However, it avoids the restrictive assumption that the ''p''-values are

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...

, unlike Fisher's method. Consequently, it controls the

false positive rate In statistics, when performing multiple comparisons, a false positive ratio (also known as fall-out or false alarm ratio) is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as th ...

when tests are dependent, at the expense of less power (i.e. a higher false negative rate) when tests are independent. Besides providing an alternative to approaches such as

that controls the stringent

family-wise error rate In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. Familywise and Experimentwise Error Rates Tukey (1953) developed the concept of a ...

, it also provides an alternative to the widely-used Benjamini-Hochberg procedure (BH) for controlling the less-stringent

false discovery rate In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expe ...

. This is because the power of the HMP to detect significant ''groups'' of hypotheses is greater than the power of BH to detect significant ''individual'' hypotheses. There are two versions of the technique: (i) direct interpretation of the HMP as an approximate ''p''-value and (ii) a procedure for transforming the HMP into an asymptotically exact ''p''-value. The approach provides a multilevel test procedure in which the smallest groups of ''p''-values that are statistically significant may be sought.

Direct interpretation of the harmonic mean ''p''-value

The

weighted harmonic mean In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired. The harmonic mean can be expressed as the recipro ...

of ''p''-values

p_1, \dots, p_L

is defined as

\overset = \frac,

where

w_1, \dots, w_L

are weights that must sum to one, i.e.

\sum_^L w_i=1

. Equal weights may be chosen, in which case

w_i=1/L

. In general, interpreting the HMP directly as a ''p''-value is anti-conservative, meaning that the

is higher than expected. However, as the HMP becomes smaller, under certain assumptions, the discrepancy decreases, so that direct interpretation of significance achieves a false positive rate close to that implied for sufficiently small values (e.g.

\overset<0.05

). The HMP is never anti-conservative by more than a factor of

e\,\log L

for small

L

, or

\log L

for large

L

. However, these bounds represent worst case scenarios under arbitrary dependence that are likely to be conservative in practice. Rather than applying these bounds, asymptotically exact ''p''-values can be produced by transforming the HMP.

Asymptotically exact harmonic mean ''p''-value procedure

Generalized central limit theorem shows that an asymptotically exact ''p''-value,

p_

, can be computed from the HMP,

\overset

, using the formula

p_ = \int_^\infty f_\textrm\left(x\,, \,\log L+0.874,\frac\right) \mathrm x.

Subject to the assumptions of generalized central limit theorem, this transformed ''p''-value becomes exact as the number of tests,

L

, becomes large. The computation uses the

Landau distribution In probability theory, the Landau distribution is a probability distribution named after Lev Landau. Because of the distribution's "fat" tail, the moments of the distribution, like mean or variance, are undefined. The distribution is a particular ...

, whose density function can be written

f_\textrm(x\,, \,\mu,\sigma) = \frac\int_0^\infty \textrm^\,\sin(2t)\,\textrmt.

The test is implemented by the p.hmp command of the

harmonicmeanpR package
 

is available online.

Equivalently, one can compare the HMP to a table of critical values (Table 1). The table illustrates that the smaller the false positive rate, and the smaller the number of tests, the closer the critical value is to the false positive rate. 


  Multiple testing via the multilevel test procedure 

If the HMP is significant at some level  $\alpha$  for a group of  $L$  ''p''-values, one may search all subsets of the  $L$  ''p''-values for the smallest significant group, while maintaining the strong-sense family-wise error rate. Formally, this constitutes a  closed-testing procedure.

When  $\alpha$  is small (e.g.  $\alpha<0.05$ ), the following multilevel test based on direct interpretation of the HMP controls the strong-sense family-wise error rate at level approximately  $\alpha:$ 

# Define the HMP of any subset  $\mathcal$  of the   $L$  ''p''-values to be $\overset_\mathcal = \frac.$  
# Reject the null hypothesis that none of the ''p''-values in subset  $\mathcal$  are significant if  $\overset_\mathcal\leq\alpha\,w_\mathcal$ , where  $w_\mathcal=\sum_w_i$ . (Recall that, by definition,  $\sum_^L w_i=1$ .)



An asymptotically exact version of the above replaces  $\overset_\mathcal$ in step 2 with  $p_ = \max\left\,$  where  $L$  gives the number of ''p''-values, not just those in subset  $\mathcal$ .

Since direct interpretation of the HMP is faster, a two-pass procedure may be used to identify subsets of ''p''-values that are likely to be significant using direct interpretation, subject to confirmation using the asymptotically exact formula.

  Properties of the HMP 

The HMP has a range of properties that arise from generalized central limit theorem. It is:

* Robust to positive dependency between the ''p''-values. 
* Insensitive to the exact number of tests, ''L''. 
* Robust to the distribution of weights, ''w''. 
* Most influenced by the smallest ''p''-values.

When the HMP is not significant, neither is any subset of the constituent tests. Conversely, when the multilevel test deems a subset of ''p''-values to be significant, the HMP for all the ''p''-values combined is likely to be significant; this is certain when the HMP is interpreted directly. When the goal is to assess the significance of ''individual'' ''p''-values, so that combined tests concerning ''groups'' of ''p''-values are of no interest, the HMP is equivalent to the Bonferroni 


Carlo Emilio Bonferroni (28 January 1892 – 18 August 1960) was an Italian mathematician who worked on probability theory. 
 Biography
Bonferroni studied piano and conducting in Turin Conservatory and at University of Turin under Giuseppe Peano  ...
 procedure but subject to the more stringent significance threshold  $\alpha_L<\alpha$  (Table 1).

The HMP assumes the individual ''p''-values have (not necessarily independent)  standard uniform distributions when their null hypotheses are true. Large numbers of underpowered tests can therefore harm the power of the HMP.

While the choice of weights is unimportant for the validity of the HMP under the null hypothesis, the weights influence the power of the procedure. Supplementary Methods §5C of  and an onlin
tutorial
consider the issue in more detail.

  Bayesian interpretations of the HMP 

The HMP was conceived by analogy to Bayesian model averaging and can be interpreted as inversely proportional to a model-averaged Bayes factor 


The Bayes factor is a ratio of two competing  statistical models represented by their marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a nu ...
 when combining ''p''-values from  likelihood ratio tests.

  The harmonic mean rule-of-thumb 

I. J. Good 




Irving John Good (9 December 1916 – 5 April 2009)The Times of 16-apr-09, http://www.timesonline.co.uk/tol/comment/obituaries/article6100314.ece 
was a British mathematician who worked as a cryptologist at Bletchley Park with Alan Turing. Afte ...
 reported an empirical relationship between the Bayes factor and the ''p''-value from a likelihood ratio test. For a null hypothesis  $H_0$  nested in a more general alternative hypothesis  $H_A,$  he observed that often, $\textrm_i\approx \frac,\quad3\frac<\gamma<30,$  where  $\textrm_i$  denotes the Bayes factor in favour of  $H_A$  versus  $H_0.$  Extrapolating, he proposed a rule of thumb in which the HMP is taken to be inversely proportional to the model-averaged Bayes factor for a collection of  $L$  tests with common null hypothesis: $\overline=\sum_^L w_i\,\textrm_i \approx \sum_^L \frac = \frac.$ For Good, his rule-of-thumb supported an interchangeability between Bayesian 
 Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister.

Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
 and  classical approaches to hypothesis testing.

  Bayesian calibration of ''p''-values 

If the distributions of the ''p''-values under the alternative hypotheses follow Beta distribution 


In probability theory and  statistics, the beta distribution is a family of continuous  probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as  ...
s with parameters  $\left(0<\xi_i<1, 1\right)$ , a form considered by Sellke, Bayarri and Berger, then the inverse proportionality between the model-averaged Bayes factor and the HMP can be formalized as $\overline=\sum_^L \mu_i\,\textrm_i=\sum_^L \mu_i\,\xi_i\,p_i^\approx\bar\xi\sum_^L w_i\,p_i^=\frac,$  where

* $\mu_i$  is the prior probability of alternative hypothesis  $i,$  such that  $\sum_^L\mu_i=1,$ 
* $\xi_i/(1+\xi_i)$  is the expected value of  $p_i$  under alternative hypothesis  $i,$ 
* $w_i=u_i/\bar\xi$  is the weight attributed to ''p''-value  $i,$ 
* $u_i = \left(\mu_i\,\xi_i\right)^$  incorporates the prior model probabilities and powers into the weights, and
* $\bar\xi = \sum_^L u_i$  normalizes the weights.

The approximation works best for well-powered tests ( $\xi_i\ll 1$ ).

  The harmonic mean ''p''-value as a bound on the Bayes factor 

For likelihood ratio tests with exactly two degrees of freedom, Wilks' theorem 
In  statistics Wilks' theorem offers an asymptotic distribution of the log-likelihood ratio statistic, which can be used to produce confidence intervals for  maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio te ...
 implies that  $p_i=1/R_i$ , where  $R_i$  is the maximized likelihood ratio in favour of alternative hypothesis  $i,$  and therefore  $\overset=1/\bar$ , where  $\bar$  is the weighted mean maximized likelihood ratio, using weights  $w_1,\dots,w_L.$  Since  $R_i$  is an upper bound on the Bayes factor,  $\textrm_i$ , then  $1/\overset$  is an upper bound on the model-averaged Bayes factor: $\overline\leq\frac.$ While the equivalence holds only for two degrees of freedom, the relationship between  $\overset$  and  $\bar,$  and therefore  $\overline,$  behaves similarly for other degrees of freedom.

Under the assumption that the distributions of the ''p''-values under the alternative hypotheses follow Beta distribution 


In probability theory and  statistics, the beta distribution is a family of continuous  probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as  ...
s with parameters  $\left(1, \kappa_i>1\right),$  and that the weights  $w_i=\mu_i,$  the HMP provides a tighter upper bound on the model-averaged Bayes factor: $\overline\leq \frac,$ a result that again reproduces the inverse proportionality of Good's empirical relationship.{{cite journal, vauthors=Held, L, date=2019, title=On the Bayesian interpretation of the harmonic mean ''p''-value, journal=Proceedings of the National Academy of Sciences USA, volume=116, issue=13, pages=5855–5856, doi=10.1073/pnas.1900671116, pmc=6442579, pmid=30890644, doi-access=free

  References 



 Multiple comparisons
 Statistical hypothesis testing